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Abstract 


Crowdsourcing provides a popular paradigm for data collection at scale. We study 
the problem of selecting subsets of workers from a given worker pool to maximize 
the accuracy under a budget constraint. One natural question is whether we should 
hire as many workers as the budget allows, or restrict on a small number of top- 
quality workers. By theoretically analyzing the error rate of a typical setting in 
crowdsourcing, we frame the worker selection problem into a combinatorial op¬ 
timization problem and propose an algorithm to solve it efficiently. Empirical 
results on both simulated and real-world datasets show that our algorithm is able 
to select a small number of high-quality workers, and performs as good as, some¬ 
times even better than, the much larger crowds as the budget allows. 


1 Introduction 

The recent rise of the crowdsourcing approach has made it possible to collect large amounts of 
human-labeled data and solve challenging problems that require human intervention at a large scale 
and at a relatively low cost. In micro-task marketplaces such as Amazon Mechanical Turk, the 
requestors can hire large numbers of online crowd workers to complete human intelligence tasks 
(HITs) in a short time and with payment as low as several cents per task. Unfortunately, because 
of the low pay and inexperience of the workers, their labeling qualities are often much lower than 
those of experts. A common solution is to add redundancy, asking many crowd workers to answer 
the same questions, and aggregating their answers; the combined results of the crowds are often 
much better than that of an individual worker, sometimes even as good as that of the experts - a 
phenomenon known as wisdom of crowds. 

However, because the crowd workers often have different reliabilities due to their diverse back¬ 
grounds, it is important to weight their answers properly when aggregating their answers. A large 
body of work has been proposed to deal with the uncertainty and diversity on the workers’ reliabili¬ 
ties; these methods often have a form of weighted majority voting where the answers of the majority 
of the workers are selected, with a weighting scheme that accounts the importance of the different 
workers according to their reliabilities. The workers’ reliabilities can be estimated either using gold 
standard questions with known answers (e.g.. Von Ahn et al., 2008, Liu et al., 2013), or by statisti¬ 
cal methods such as Expectation-Maximization (EM) (see, e.g., Dawid and Skene, 1979, Whitehill 
et ah, 2009, Karger et ah, 2011, Liu et al., 2012, Zhou et ah, 2012). 

Our work is motivated by a natural question: do more crowd workers necessarily yield better ag¬ 
gregated results than less workers? The idea of wisdom of crowds seems to suggest a confirmative 
answer, since “larger crowds should be wiser”. From a Bayesian perspective, this would be true if 
we had perfect knowledge about the workers’ prediction model, and we were able to use an oracle 
aggregation procedure that performs exact Bayesian inference. However, in practice, because the 
workers’ prediction model and reliabilities are never known perfectly, we run the risk of adding 


1 




noisy information as we increase the number of workers. In the extreme, there may exist a large 
number of “spammers”, who submit completely random answers rather than good-faith attempts to 
label; adding these spammers would certatinly deteriorate the results, unless we are able to identify 
them perfectly, and assign them with zero-weights in the label aggregation algorithm. Even if there 
exist no extreme spammers, the median-level workers may still decrease the overall accuracy if they 
dominate over the small number of high-quality workers. In fact, a recent empirical study (Mannes 
et al., 2013) shows that the aggregated results of a small number of (3 to 6) high-quality workers are 
often more accurate than those of much larger crowds. 

In this work, we study this phenomenon by formulating a worker selection problem under a budget 
constraint. Assume we have a pool of workers whose reliabilities have been tested by a small number 
of gold standard questions; under certain label aggregation algorithm, we want to select a subset of 
workers that maximizes the accuracy, with a budget constraint that the number of workers assigned 
per task is no more than K. A naive and commonly used procedure is to simply select the top K 
workers that have the highest reliabilities. However, due to the noisy nature of the label aggregation 
algorithms (e.g., majority voting or EM), selecting all the K workers does not necessarily give the 
best accuracy, and may cause a waste of the resource. We study this problem under a simple label 
aggregation algorithm based on weighted majority voting, and propose a worker selection method 
that is able to select fewer (< K) top-ranked workers, while achieve almost the same, or even better 
aggregated solutions than the naive method that uses more (all the top K) workers. 

Our method is derived by framing the problem into a combinatorial optimization that minimizes 
an upper bound of the error rate, and deriving a globally optimal algorithm that selects a group of 
top-ranked workers that optimize the upper bound of the error rate. We demonstrate the efficiency 
of our algorithm by comprehensive experiments on a number of real-world datasets. 

Related work. There are many literatures on estimating the workers’ reliabilities and eliminating 
the spammers based on a predefined threshold (see e.g., Raykar and Yu, 2012, Joglekar et al., 2013). 
Our work instead focuses on selecting a minimum number of highest-ranked workers while dis¬ 
carding the others (which are not necessarily spammers). Note that our method has the advantage of 
requiring no pre-specified threshold parameters. Our work also should be distinguished with another 
line of research on online assignment for crowdsouring (Chen et al., 2013, Ho et al., 2013, etc.), 
which have different objectives and purposes from our work. 

Outline. The rest of the paper is organized as follows. We introduce the background and the problem 
setting in Section 2. We then formulate the worker selection problem into a combinatorial optimiza¬ 
tion problem and derive our algorithm in Section 3. The numerical experiments are presented in 
Section 4. We give further discussions in Section 5 and conclude the paper in Section 6. 

2 Background and problem setting 

Assume there are M crowd workers and N items (or questions) each with labels from L classes. 
For notation convenience, we denote the set of workers by = [M], the set of items by [N] and the 
set of label classes by [L], where we use [M] to denote the set of first M integers. We assume each 
item j is associated with an unknown true label yj £ [L\, j £ [AT], We also assume that we have n 
control (or gold standard ) questions whose true labels y) £ [L] , j £ [n] are known. 

When item j is assigned to worker i for labeling, we get a possibly inaccuracy answer from the 
worker, which we denote by Z,j £ [L\. The workers often have different expertise and attitude, and 
hence have different reliabilities. We assume i-th worker labels the items correctly with probability 
Wi, that is, Wi = P (Zij = yj). In addition, assume we have an estimation of the workers’ reliability 
which can be estimated either based on the workers’ performance on the control items, or by 
probabilistic inference algorithms like expectation-maximization (EM). With a known reliability 
estimation Wi , most label aggregation algorithms, including the naive majority voting and EM, can 
be written into a form of weighted majority voting, 

Vj = argma xV/(w t ) • I (Z^ = k ), (1) 

fce I L l tes 

where f(wi) is a monotonic weighting function that decides how much the answers of worker i 
contribute to the voting according to the reliability Wi, and I (•) is the indicate function. For majority 
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voting, we have f mY (wi) = 1, which ignores the diversity of the workers and may performance 
badly in practice. In contrast, a log-odds weighting function fi og (wi ) = logit(u)j) — logit(l/L), 

fa j A 

where logit(u)j) = log( ), can be derived using Bayesian rule under a simple model that 
assumes uniform error across classes; here 1/L is the probability of random guessing among L 
classes. However, in practice, the log-odds may be over confident (growing to infinite) when wt is 
close to 1 or 0. A linearized version /linear (wf) = Wi — 1/L has better stability, and is simpler for 
theoretical analysis (Li et al., 2013). 

Note that both of f\ og and /linear have properties that are desirable for general weighting functions: 
Both are monotonic increasing functions of uii, and take zero value if if), = 1/L (to exclude the 
labels from random guessers); they are both positive if Wi > 1/L (better than random guessers), 
and are both negative if Wi < 1/L (worse than random guessers). These common properties make 
/linear and f\ og work similarly in practice. But since /ii ne ar is more stable and simpler for theoretical 
analysis, we will focus on the linear weighted function /linear for our further development on the 
worker selection problem, that is, the labels are aggregated via (referred as WMV-linear), 

Vj = argmax 'S^(Lw i - 1) • I (Z tj = k) . (2) 

ke ^ ies 

In the next section, we study the worker selection problem and propose an efficient algorithm based 
on the analysis of the WMV-linear aggregation method. 


3 Worker selection by combinatorial optimization 


The problem of selecting an optimal set of workers requires predicting the error rate with a given 
worker set, which is unfortunately intractable in general. However, it is convenient to obtain an 
upper bound of the error rate for the linear weighted majority voting. 


Theorem 1. Given a set S of workers, using the weighted majority voting in (2) with linear weights 
/linear and an unbiased estimator of the reliabilities {w,;} igS that satisfies E[w/] = If the 
workers’ labels are generated independently according the following probability 


P (Zij = l\Vj = k) 


Wi if l = k , 

k - 


(3) 


Then we have 


1 % \ 

J2 p (% * vs) ^ ex p 

j=i 


2 F(S ) 2 
L 2 (L — l) 2 


+ ln(L — 1) , 


(4) 


where 




(5) 


Remark: (i) Note that the above upper bound depends on the worker set S and their reliabil¬ 
ities Wi only through the term F(S). In fact, according to the proof in the supplementary, the 
term F(S) corresponds to the expected gap between the voting score of the true label y, (i.e. 
SiGS /iinear(«/)I (Zij = yf)) and that of the wrong labels, and hence reflects the confidence of 
the weighted majority voting. Therefore, F{S) represents a score function for the worker set S: if 
F(S) is large, the weighted majority voting is more likely to give correct prediction. 

(ii) The assumption (3) used in Theorem 1 implies a “one-coin” model on the workers labels, where 
the labels are correct with probability w j, and otherwise make mistakes uniformly among the re¬ 
maining classes. This is a common assumption to make, especially in theoretical works (see e.g., 
Karger et al. (2011), Ghosh et al. (2011), Joglekar et al. (2013)). It is possible to relax (3) to a 
more general “two-coin” model with arbitrary probability P (Zij = l\yj = k), which, however, 
may lead more complex upper bounds. In our empirical study on various real-world datasets, we 
find that F(S) remains to be an efficient score function for worker section even when the one-coin 
assumption does not seem to hold. 

Based on (5), it is natural to select the workers by maximizing the term F(S), that is, 

argmax F(S), s.t. |Sj < K, (6) 

sen 
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Unfortunately, F(S) depends on the workers’ true reliabilities w;,, which is often unknown. We 
instead estimate F(S) based on w t . The following theorem provides an unbiased estimator. 

Lemma 2. Assume Wi is an unbiased estimator of Wi that satisfies E[t&j] = w^, and var(mj) is an 
unbiased estimator of the variance ofwi. Consider 

where 


F(S) 


G(wi) = (Liii - l ) 2 - L 2 vav(wi), 
then F(S) is an unbiased estimate ofF(S). 


( 8 ) 


Remark: (i) The first term (Liii — l) 2 in (8) shows that the workers with w, close to either 1 
or 0 should be encouraged; these workers tend to answer the questions either all correctly or all 
wrongly, and hence are “strongly informative” in terms of the predicting the true labels. Note that 
these workers with Wi = 0 are strongly informative in that they eliminate one possible value (their 
answer) for the tme labels. On the other side, more workers also means more noise, so there is a 
term -y/]5| for balancing the signal-noise ratio — to encourage hiring “strong” workers instead of 
only hiring more workers. 


(ii) A simpler estimation of F(S) is to directly plug w-, as Wi into (5), that is. 


Fpiug(S) 


1 

7W\ 


^(Lm - 
ies 


l) 2 . 


(9) 


However, this obviously leads to a biased estimator of F(S) because of the missing of the variance 
term in (8). The existence of the variance term is of critical importance: The workers with large 
uncertainty on the reliabilities should be less favorable compared with these with a more confident 
estimation. 


Since Lemma 2 does not specify uii and var(t&»), the next theorem provides a concrete example of 
F(S), based on which a symmetric confidence interval of F(S) can be constructed. 

Theorem 3. Assume a group of workers are tested with n control questions, and let d be the number 
of correct answers given by worker i on the n control questions. Then an unbiased estimator Wi, 
with an unbiased estimator ofvar(wi) can be obtained by 

^ , Ci(n-d) 

Wi = — and var(Wi) = —^ -—. (10) 

n n z (n — 1) 

With such Wi and var (wf), the corresponding F(S) in (7) is unbiased and the interval [F(S) — 
ct, F(S) + n( ^~i 1) a] covers F(S) with probability at least 1 — 2e _2 “ for any a > 0. 


Remark: A discussion about the advantage of the unbiasness of F(S) and the symmetric confidence 
interval is deferred to Section 5. 

Based on the estimation of F(S) in Theorem 3, the optimization problem is rewritten into 

argrnax F(S), s.t. |S’| < K. (11) 

sen 

where F(S) is defined in (7). Although this combinatorial problem is neither sub-modular nor super- 
modular, we show it can be exactly solved with a linearithmic time algorithm shown in Algorithm 1. 

Algorithm 1 progresses by ranking the workers according to G(wi) in a decreasing order, and se¬ 
quentially evaluates the groups of the top-ranked workers, and then finds the smallest group that 
has the maximal score F(S). The time complexity of Algorithm 1 is 0(|fi| log |fi|) and the space 
complexity is 0(|fi|), where f l is the whole set of workers. 

The following theorem shows that Algorithm 1 achieves the global optimality of (11). 

Theorem 4. For any fixed {u)j} igf2 . The set S* given by Algorithm 1 is a global optimum of 
Problem (11), that is, we have F(S*) > F(S) forWS £ Cl that satisfies |S'! < K. 
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Algorithm 1 Worker selection algorithm 

1: Input: Worker pool SI = {1, 2,..., M} and estimated reliabilities {w>i},: e n f rom n control 
questions; Number of label classes L; Cardinality constraint: no more than K workers per item. 


2 : 

3 : 

4 : 

5 : 

6 : 

7 : 

8 : 


Xi <r- G(u>i ), Vi € SI as in (8), and sort {allien i n descending order so that > x a ( 2 ) 
... > x a (M)’ where a is a permutation of {1, 2, • • • , M}. 

B <r- min(iv, M), gi t— Xa(i) and F\ -t— gi. 
for k from 2 to B do 


gk •*— gk-l + Xcr(k) 


and 


p , 9k 


end for 


> 


k* <r- min < argmax Fk > . 

I 1 <k<B J 

Output: The selected subset of workers S* «— {cr(l), <r(2), • • • , a(k*)}. 


Remark: As a generalization, consider the following multiple-objective optimization problem, 

argmax(T 1 (S'), — |Sj), s.t. IS 1 ! < K, 

sen 

which simultaneously maximizes the score F(S) and minimizes the number IIS'! of workers actually 
deployed. We can show that S* is in fact a Pareto optimal solution in the sense that there exist no 
other feasible S that improves over F(S) in terms of both F(S) and |Sj (details in supplementary). 

4 Experimental results 

We demonstrate our algorithm using empirical experiments based on both simulated and real-world 
datasets. The empirical results confirm our intuition: Selecting a small number of top-ranked work¬ 
ers may perform as good as, or even better than using all the available workers. In particular, we 
show that our worker selection algorithm significantly outperforms the naive procedure that uses all 
the top I\ workers. We find that our algorithm tends to select a very small number of workers (less 
than 10 in all our experiments), which is very close to the optimal number of the top-ranked workers 
in practice. 

To be specific, we consider the following practical scenario in the experiments: (i) Assume there is a 
worker pool f l where each worker has completed a “qualify exam” with n control questions, which 
is required by either the platform or a particular task owner, (ii) The task owner selects a subset of 
workers from f 1 using a worker selection algorithm such as Algorithm 1 based on their performance 
on the qualify exam, (iii) The selected workers are distributed to answer the N questions of the 
main interest, (iv) Label aggregation algorithms such as WMV-linear or EM are applied to predict 
the final labels of these N items. 

Even though our worker selection algorithm is derived when using WMV-linear, we can still use 
other label aggregation algorithms such as EM, once the worker set is selected. This gives the follow¬ 
ing possible combinations of the algorithms that we test: WMV-linear on the top K workers (WMV 
top K), WMV-linear on the worker set S* selected by Algorithm 1 (WMV-lin selected), and 
WMV with log ratio weights on the selected worker set S* (WMV-log selected), the EM al¬ 
gorithm on randomly selected K workers (referred as EM random K), EM on the top K workers 
ranked (EM top K) and EM on the worker set S* selected (EM selected). We also implement 
the worker selection algorithm based on the plugin estimator in (9) (which is the same as Algorithm 
1, except replacing G(wi) with (Libi — l) 2 ), followed with a WMV-linear aggregation algorithm 
(referred as WMV-lin plugin). Since the majority voting tends to perform much worse all the 
other algorithms, we omit it in the plots for clarity. 

In each trial of the algorithms on both the simulated and real-world datasets, 10 items are randomly 
picked from the collected data as the control items, and the workers’ reliabilities {'&,} l(z p are es¬ 
timated based on the accuracy on the control items as (10). In each trial, the number of workers 
selected by Algorithm 1 was stored and the average number of workers was computed for each bud- 
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get K. We terminate all the iterative algorithms at a maximum of 100 iterations. All results are 
averaged over 100 random trials. 


4.1 Simulated data 
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Figure 1: Performance of different worker selection methods on simulated data. WMV-linear aggre¬ 
gation is used in all the cases. We simulated 31 workers and 1000 items with binary labels, and use 
10 control questions. The workers’ reliabilities are drawn independently from Beta{ 2.3, 2). (a) The 
accuracies when the budget K varies, (b) The actual number of workers used by different worker 
selection methods when K increases. 


We generate the simulated data by drawing 31 workers with reliability Wi from Beta( 2.3,2), and 
we randomly generated 1000 items with true labels uniformly distributed on {±1}. The budget K 
varies from 3 to 31. Figure 1(a) shows the accuracy of WMV-linear with different worker selection 
strategies as the budget K changes. We can see that WMV-lin selected dominates the other 
methods. Figure 1(b) shows the actual number of workers selected by worker selection algorithm 
(Algorithml). WMV-linear based on our selected workers uses a relatively small number of (always 
< 10) workers (the red curve in Figure 1(b)), and achieve even better performance than WMV-lin 
top K that uses the entire available budge (the blue line in Figure 1(b)). We find that the worker 
selection algorithm based on the plugin estimator F p \ ug (S) tends to select slightly more workers, but 
achieves slightly worse performance than Algorithm 1 based on the bias-corrected estimator F(S) 
(see WMV-lin select vs. WMV-lin plugin in Figure 1(a)). This implies the importance of 
the variance term in (8), which penalizes the workers with noisy reliability estimation. 

The number of control questions n controls the variance of the reliability estimation ibi, and hence 
influences the results of the worker selection algorithms. Figure 2(a) shows the results when we vary 
n from 3 to 45, with the budget fixed at K = 20. We see that the performance of all the algorithms 
increases when n increases, because we know more accurate information about the workers’ true 
reliabilities, and can make better decision on both choosing the top K workers and selecting workers 
by Algorithml. In addition, when n increases, the variance of u;, decreases and the difference 
between WMV-linear selected and WMV-linear plugin decreases. 

Figure 2(b) shows the results when when we vary the prior parameter a where ~ Beta(a, 2), 
fixed K = 20 and n = 10. Larger a means the workers are more likely to have high reliabilities 
(i.e., close to 1). We see from Figure 2(b) that WMV-lin top K increases as a increases, due to 
the overall improvement of the reliabilities of the top K workers. The performances of WMV-lin 
selected and WMV-lin plugin improves only slightly, probably because they only select 
several top workers which is not heavily affected by a. 


6 













0.96 


0.94 

O 0.92 

Vh 

^ 0.9 

o 

^ 0.88 - 

<C 

0.86 - 
0.84 - 
0.82 - 
0.8 


j£ 

✓ 




% -O H 


1 

0.995 - 

oW*- - 


0.985.=---+. 


c3 

^ 0.98 

O 

o 

< 0.975 






WMV-lin top K 

0.97 

/ 

-e- WMV-lin top K 

■-*■■■ WMV-lin plugin 

0.965 


WMV-lin plugin 

—o— WMV-lin selected 



—a— WMV-lin selected 


o 


n: number of control items 
(a) 


a: worker reliability prior Beta(a, 2) 
(b) 


Figure 2: Performance of different worker selection methods, (a) when changing the number of 
control questions n, and (b) when changing the parameter a in the reliability prior Beta(a , 2). The 
budget K is fixed at 20. We use the WMV-linear aggregation method in all the cases. 


4.2 Real data 

We test the different worker selection methods on three real-world datasets: two collected by our¬ 
selves from the crowdsourcing platform Clickworkers ', and one by Welinder et al. (2010) from 
Amazon Mechanical Turk. 

Crowd-test dataset: In this dataset, 31 workers are asked to answer 75 knowledge-based questions 
from allthetests.com, which cover topics such as science, math, common knowledge, sports, geog¬ 
raphy, U.S. history and politics and India. All these questions have 4 options, and we know the all 
the ground truth beforehand. We required each worker to finish all the questions. A typical example 
of the knowledge-test question is as follows: 

( Question): In what year was the Internet created? 

(Options): A. 1951;' B. 1969; C. 1985; D. 1993. 

Figure 3(a) shows the performance of the different methods as the budget K changes. Since EM 
is widely used in practice, we include the results when using it as the label aggregation algorithm 
after the workers are selected. We find that the performance of EM Top K first increases when 
K is small and then decreases when K is large enough (> 10 in this case). Our worker selection 
algorithm selects much smaller number of workers, while much better performance, compared to 
the top K and random selection methods. 

Disambiguity dataset: The task here is to identify which Wikipedia page (within 4 possible options) 
a given highlighted entity in a sentence actually refers to. We collected 50 such questions in the 
technology domain with ground truth available, and hire 35 workers through Clickworkers, each of 
which is required to complete all the questions. A typical example is as follows: 

( Question): “The Microsoft .Net Framework 4 redistributable package install the .NET Framework 
runtime and associated files that are required to run and develop applications to target the .NET 
Framework 4’’. Which Wiki page does “runtime” refer to? 

(Options): 

A. http://en.wikipedia.org/wiki/Run-time_system 

B. http://en.wikipedia.org/wiki/RuntimeJibrary 

C. http://en.wikipedia.org/wiki/RunJimefiprogram 
Jifecycle _phase) 

D. http://en.wikipedia.org/wiki/Run_Time_ 

Infrastructure fi simulation ) 


1 http: //www. clickworker. cotn/en 
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Figure 3: Crowd-test data. 10 items were randomly selected as control, (a) Performance curve of 
algorithms with K increasing, (b) Number of workers the algorithms actually used for each K. 


Bluebird dataset: It is collected by Welinder et al. (2010) and is publicly available. In this dataset, 
39 workers are asked if a presented image contains Indigo Bunting or Blue GroBeak. There are 108 
images in total. 
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Figure 4: More performance comparison on real-world datasets, (a) The disambiguity dataset: 35 
workers and 50 questions in total, (b) The bluebird dataset: 39 workers and 108 questions in total. 
The settings are the same as that of Figure 3(a). The number of worker actually used (similar to 
Figure 3(b)) are plotted in supplementary. 


Figure 4 (a) and (b) show the performance of the different algorithms on the bluebird and the dis¬ 
ambiguation dataset, respectively. The results are similar to the one in Figure 3(a). For the disam¬ 
biguation dataset, the number of workers selected is usually no more than 6, and the corresponding 
number for bluebird dataset is 9. See the supplementary for the the plots of the number of workers 
the algorithms actually used (similar to Figure 3(b)) for each K on these two datasets. 

Note that WMV-lin selected, WMV-log selected and EM selected are based on the 
workers selected by Algorithm 1. They achieve better performance than EM based on the top K or 
the random selected workers when K is large. This shows that aggregation based on inputs from 
selected workers not only saves budget but also maintains good performance. 
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5 Discussion 


What is the advantage of ensuring that F(S) is an unbiased estimate of F(S)? The true objective 
function F(S) is unknown, and we can only optimize over a random estimation F(S). If F(S) 
is a biased estimator and the bias depends on {u)j} ien , then the optimum solution may be very 
different from the underlying true solution. With the unbiased estimator and the symmetric confi¬ 
dence interval gurantee shown in Lemma 2 and 3, optimizing F(S) is equivalent to optimizing a 
proper confidence bound, because the margin in the confidence interval often does not depend on 
the workers’ reliabilities. The results in Figure 1 confirm that with the unbiased estimator F(S), the 
performance of WMV on the selected workers is better than that with the biased plugin estimator 

W-s). 

Why does WMV-linear perform better than WMV with /log? In some of our empirical results (e.g.. 
Figure 4), we find that WMV with log ratio weight is not as good as the one with the linear weight. 
It is mainly because there is a high chance that some workers get estimated reliability w close to 
0 or 1 when the number of control questions is small (e.g., n = 10). Even if we do truncation to 
prevent a weight fi os (wi) from going to oo, the large weights of some workers may still lead to 
unstable aggregations. However, the performance of WMV with f og improves when we use larger 
n or heavier truncation on to*. 

Why does EM with top-K workers perform poorly as K increases? Within the given pool of work¬ 
ers, we add increasingly less reliable workers (compared with the workers already selected) as K 
increases; these less reliable workers may confuse the EM algorithm, causing worse reliability es¬ 
timation as well as final prediction accuracy. This intuition matches with our empirical results in 
Figure 3 and 4: the performance of EM generally first increases when K is small (with increasingly 
more top-quality workers), but then decreases when K is large (as more less reliable workers are 
added). 

6 Conclusion 

In this paper, we study the problem of selecting a set of crowd workers to achieve the best accu¬ 
racy for crowdsourcing labeling tasks. We demonstrate that our worker selection algorithm can 
simultaneously minimize the number of selected workers and minimizing the prediction error rate, 
achieving the best in terms of both cost and efficiency. For future directions, we are interested in 
developing better selection algorithms based on more advanced label aggregation algorithms such 
as EM, or more complex probabilistic models. 


References 

L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. reCAPTCHA: Human-based 
character recognition via web security measures. Science, 321 (5895): 1465—1468, 2008. 

Q. Liu, A. Ihler, and M. Steyvers. Scoring workers in crowdsourcing: How many control questions 
are enough? In NIPS, 2013. 

A.P. Dawid and A.M. Skene. Maximum likelihood estimation of observer error-rates using the em 
algorithm. Journal of the Royal Statistical Society., 28(l):20-28, 1979. 

J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: 
Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009. 

D.R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In NIPS, 
2011 . 

Q. Liu, J. Peng, and A. Ihler. Variational inference for crowdsourcing. In NIPS, 2012. 

D. Zhou, J. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. 
In NIPS, 2012. 

A. E. Mannes, J. B. Soil, and R. P. Larrick. The wisdom of small crowds, 2013. URL https : 
//faculty.fuqua.duke,edu/~jsoll/. 

V.C. Raykar and S. Yu. Eliminating spammers and ranking annotators for crowdsourced labeling 
tasks. The Journal of Machine Learning Research, 13:491-518, 2012. 


9 


M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Evaluating the crowd with confidence. In 
SIGKDD, 2013. 

X. Chen, Q. Lin, and D. Zhou. Optimistic knowledge gradient policy for optimal budget allocation 
in crowdsourcing. In ICML , 2013. 

C. Ho, S. Jabbari, and J.W. Vaughan. Adaptive task assignment for crowdsourced classification. In 
ICML, 2013. 

H.W. Li, B. Yu, and D. Zhou. Error rate bounds in crowdsourcing models. arXiv preprint 
arXiv:1307.2674, 2013. 

A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection 
in user-generated content. In ACM conference on Electronic commerce, pages 167-176. ACM, 
2011 . 

P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In 
NIPS, 2010. 

Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers 
— a Bayesian graphical model for adaptive crowdsourcing and aptitude testing. In ICML, 2012. 

Y. Yan, R. Rosales, G. Fung, and J.G. Dy. Active learning from crowds. In ICML, 2011. 

Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. G. Dy. Modeling 
annotator expertise : Learning when everybody knows a bit of something. In ICML, 2010. 

F.L. Wauthier and M.I. Jordan. Bayesian bias mitigation for crowdsourcing. In NIPS, 2011. 


10 



Supplementary Material 


“Cheaper and Better: Selecting Good Workers for Crowdsourcing ” 2 


Proof of Theorem 1: performance guarantee of WMV-linear 

Proof. Without loss of generality, we denote by 7r the prevalence of true labels, i.e., P(y ;i = k) = 
n k ,\/j € [A r ]. k £ [L\, where P denotes the probability measure. Note that even in the scenario 
that ijj is assumed as fixed instead of random, our analysis and results will still hold with tt/,. = 

I (y : j = k). Furthermore, we assume the group of workers are S with |= M. 

For WMV-linear, the weights are independent of the data matrix Z. The associated weighted 

majority voting is 

M 

Vj = argmax Y] r>J (Z^ = k), 

k £[ L \ i —i 

where = Liij — 1 and E[uij] = Wj. Thus, we have Er>j = o, = Lwi — 1 and — 1 < i>i < L — 1. 

Let 


M 

s k ] = *0> Vk£[L],je[N] (12) 

i— 1 

be the aggregated score of jth item that on potential label class k. Thus the general aggregation rule 
can be written as yj = argmax fcg r L ] s f . 

We will frequently discuss condition probability, expectation and variance conditioned on the event 
{yj = k }. Without introducing ambiguity in the context, we define: 

V k { ■) = ¥{■ \ Vj = k ) (13) 

Efc [ • ] = E[ • | yj = k ] (14) 


Note that 


Efe 



^ Vi ( Wl l (i! = k )+(I (/ f k)) , VI, k£ [L\. 


(15) 


2=1 V N ' 

First of all, we expand the error probability of labeling the j-th item wrong in terms of the conditional 
probabilities: 


p (% fyf = 5Z = fc ) p (& f k \yj = k )= 7r fc p fc ^ fc ) ■ ( 16 ) 

fcG[L] fee[i] 

Our major focus in this proof is to bound the term P/,. (y 3 f k). Our approach will be based on the 
fact of the following events relations: 

U { s i a) >4 j) } C {yjfk} C U {sp } > 4 J) } . (17) 

We want to provide an upper bound for P {yj f yf). Note that 


(y.i f k ) < ( |J {s z 0) > 4°} ) < Pk ( s 4 ^ s k ] ) ■ ( 18 > 

/ ie[L],i^k 

2 The equation numbers in this supplementary continue with the ones in the main paper. 
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With s[ ,?! defined as in (12), and when l ^ k, we define 

$ = - 4° = * a = 0-1 = *)), 


Ei. 


t(») 

Sfci 


= E Di ■ 


1 — Lwi 

L- 1 


L — 1 


(L^i - l) 2 , 


M 


A Lf = E E * 


.O') _ O) 

? /c 6 z 


M 


M 


= -E e *fJ = l3tE^- 1 ) 2 - 


(19) 

( 20 ) 

( 21 ) 


We have 


Tk > » 1 >) 


M 


— 


p fc ^ h (i (Zij = o-i = *0) > oj 

( MM M 

E 4E E e 4^ } 1 ^ - E e 4^ 

i=l i=l 

( M M 


^2—1 


2=1 


s/eZ 


>\i 


0 ) 


( 22 ) 


Note that {$£?} are conditional independent when given {yj = k }, and they are bounded 
given the voting weight {^} ig r M ] are bounded. Therefore, we could apply Hoeffding concentration 
inequality to further bound 


Pfe ( 


-0) > -O') 
*z — 


Apparently, — 1 < (E < (L — 1). Note > 0, by appling Hoeffding inequality to (22), 


/ M M 

p fc (4 j) > 4 j) ) < ^ E4i ! E E '- 


S/cZ 


> Ag> 


< exp — - 


2A 


0)" 

kl 


eE [(i -1) - (-1)]" 


< exp — - 


2A 


or 

kl 


LVM 


< e 


-2 r 


where t = 


Ef = 1 (M-i) 2 . 


l(l-i)Vm ■ 

The right hand side of last ineiquality does not depend on k. I or i, straightforwardly, 

P k (Vj ¥= k) < E Ffc (* 

Z£ [ L] ,l^k 


r > »?> 


) < (L- l)<r““ 


Furthermore, we have 


p (% ¥= Vj) = E 7r * p * (vj ^ 

fce[£] 


(23) 


< (i-1)' 


—2t 


E 

z fcG[L] 




:=1 . e -2t 2 +ln(L-l) 


(24) 
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The bound e 2t2 +in(L 1 ) does not depend on j, thus it is also a valid bound for the mean error rate. 
That is to say 


Note that t 


N N 

i=i 3=1 

l(l-i) • t ^ lus we ^ ave P rovec l ^e desired result. 


e -2t 2 +ln(L-l) 


□ 


Proof of Lemma 2: unbiasness of F(S) 


Proof. Assume |^1 = k. Let F p i us (S) be the pluggin estimator for F(S ), i.e., 

-Fpiug(<S') = —j= — 1) . 


ies 


First we show that F p \ ug (S) is a biased estimate of F(S). 

E[F plug (S)] = ^^(L 2 EK 2 ]-2LEK] + 1) 

v ies 


X! (^ 2 ( var (^) + wf) - ZLwi + l) 

v* ieS 

-Jf ^2 (( Lu ’i - 1 ) 2 + L 2 var(wi)) 

Vk . Q 


= Fis) + T^ L 


vaxiWi). 


eS 


Note that Eii* = Wi and E[var(L'j)] = var(wj), thus we can move terms around to construct an 
unbaised estimate of F(S) based on F p \ ug (S): 

F(S) = E F plug (S) -^7=J2 L 2 vax(wi) 

ies 

which leads to a unbaised estimate of F(S) as follows. 

F(S) = F p i ug (S)--^L 2 var(^) 


ies 


-If ~ 1 ) 2 “ L 2 var(wi)) , 

^ ies 


which is the same form as (7) and E[F 1 (5)] = F(S). 


□ 


Proof of Theorem 3: symmetric confidence interval 


Proof. Similar to the proof of Lemma 2, we assume |= k. With w, and var (wf) defined as in 
(10), it is straightforwardly to show that EtDj = w, and E[var («)»)] = w ^ 1 ~ w ''> = var(?i>j), then by 
Lemma 2 the corresponding unbaised estimator of F(S) is 


^ (S) - 


ies 

n 


fT - i\2 L 2 Wi(l-Wi) 
(. Lwi - 1 ) - 


n — 1 


(n — l)Vk ~s 


Lw; — 1 — 


L — 2 
2 n 


(L- 2) 2 | L- 1 
4 n 2 n 
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therefore G has two equivalent forms: 


G = (M, - If - i2 *‘ (1 “ 


n — 1 


and 


G = 


n — 1 


Lwj — 1 — 


L-2 
2n 


(L- 2) 2 r L - 1 
4n 2 


+ 


(25) 

(26) 


Next, we prove the confidence interval of F(S) based on the form (26) of G. We define random 
variables {Xj} ieS as 


where A = 


( (L- 2) 2 L—1 A 
y 4n 2 ' n J ' 


X,; = I/M),; - 1 - 


L- 2 
2n 


-A, 


Then 


^(S) = 


— 


Note that {A',} vf= ^ are a collection of indepdent random variables, —A < X, < [L — 1 — — 

A, and E F(S) = F(S). We can apply Hoeffding Inequality to bound the following probability. 


F(S) - F(S) 


< 


(n — l)Vk 


■P 


ieS 


(n — l)Vk 
y (Xi - EX,) 


yx,-E 


E x * 

.ies 


< 


(n — 1 )v / fc 


\ ies 

> 1 — 2 exp ( — 


< 


2/3 2 


where a = 


(L-\-^) 2 Vk 


, hl- i-^y) 

1 - 2e _2 “ 2 , 

and the inequality is due to Hoffding bound. Meanwhile, 


(27) 


< 


F(S) - F{S) 
F(S) - F(S) 
F(S) - F(S) 


< 


(n — 1 )y/k 


■P 


(n - 1) j 

n(L — l) 2 
< -:— a 


n—1 


(28) 


which implies that [F(S) — n ^_i^ a,F(S) + n ^_i > ct] covers X(S') with probability at least 
1 — 2e _2 “ 2 . 


4^-i) 


□ 


Proof of Theorem 4: the global optimality of worker selection algorithm 


Proof. Let Xi 


where 


(Luii — l) 2 — L 2 var(u>i), then the optimization problem (11) can be written as 

argmaxF(5) s.t. |5| < K (29) 

sen 


F(S) 


1 

7 W\ 


E- 
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Note that {w, } i£f2 are given in these optimization problems, thus we do not treat x, as random. The 
problems (11) (i.e., (29)) are deterministic combinatorial problems. In this proof, we show that the 
output from Algorithm 1 achives the global maximum of problem (29). 

The worker selection problem is to select a worker set denote by S* such that F(S *) > F(S) for 
any set of workers S C Cl. Let er be a permutation of 0 = {1,2,--- , M} such that Xo-(i) > x a-( 2 ) > 

■ ■ • > x a(M)- 

We want to show that given any globally optimal solution of problem (29) S*, which has cardinality 
15*1 = k*, we have F(S*) = A({cr(l), er(2), • • • ,tr(jfc*)}). 

To see this, let S' = {cr(l), <r(2), • • • , <r(fc*)}, and we assume F(S*) > F(S'). Since the value 
of function F(S) only depdends on cardinality of S and {xi} igS , the configuration 3 of values 
is not equal to {xi} igiS ,. This further implies that there exist i £ S*\S' and j £ S'\S* 
such that Xi ^ Xj. Since i ^ S' and S' is the top k* x-values, then Xi < Xj. Therefore, 
if we replace i in 5* with j will increase the value of F, i.e., F((S*\{i}) U {j}) > F(S*). 
This contradicts with the fact that F{S*) is global optimum. Thus we conclude that F{S*) = 

The analysis above implies that if we know the cardinality of the global optimal solution k*, then 
the top k* workers in terms of x-values will be the global optimum in problem (29), although it 
might not be the unique one. Based on the fact that the cardinality of S * has to be one of the values 
in {1,2, • • • , min(A', M)}, we can compute the value of F({cr(l), <r(2), • • • , cr(fc)}) with k from 1 
to min (K, M ). Then the maximum of the yielded F function values has to be a global optimum of 
problem (29), and thus the corresponding worker set is global optimum of problem (11). Algorithm 
1 follows exactly the procedure described above, therefore it output a globally optimal worker set. 

D 

As mentioned in the remark of Theorem 4, we can show that S* also solves the following multi¬ 
objective optimization problem that simultaneously maximizes the score F(S) and minimizes the 
number \S\ of workers actually deployed. 

Theorem 5. Consider a multiple-objective optimization problem, 

argmax(F(5), — |Sj), s.t. |Sj < K, 

sen 

then S* is its Pareto optimal solution. 

Proof. By Theorem 4, suppose S * is the global optimum of problem (11) and \S*\ < K, then there 
is no other set S such that S < I\ and F(S) > F(S*). This implies that within the sets with 
cardinality no more than I\, there is no other set could improve F(S). Thus S* is Pareto optimal 4 
according to its definition in the context of multiple objective optimization. □ 


’Here, we use configuration to denote sets that allow duplicates of values such as {1,1,1, 2,3,3}. 

4 http://en.wikipedia.org/wiki/Multi-objective-optimization 
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Avg #workers used 




K: maximum #workers per item K: maximum #workers per item 

(a) (b) 


Figure A.5: The number of workers the algorithms actually used for each K on the two read-world 
datasets in Figure 4 (Section 4.2 ): (a) The disambiguity dataset, (b) The bluebird dataset. 
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